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Abstract 

Metagenomic sequencing provides a unique opportunity to explore earth's limitless environments harboring scores 
of yet unknown and mostly unculturable microbes and other organisms. Functional analysis of the metagenomic 
data plays a central role in projects aiming to explore the most essential questions in microbiology, namely 'In a 
given environment, among the microbes present, what are they doing, and how are they doing it?' Toward this 
goal, several large-scale metagenomic projects have recently been conducted or are currently underway. 
Functional analysis of metagenomic data mainly suffers from the vast amount of data generated in these projects. 
The shear amount of data requires much computational time and storage space. These problems are compounded 
by other factors potentially affecting the functional analysis, including, sample preparation, sequencing method and 
average genome size of the metagenomic samples. In addition, the read-lengths generated during sequencing 
influence sequence assembly, gene prediction and subsequently the functional analysis. The level of confidence for 
functional predictions increases with increasing read-length. Usually, the most reliable functional annotations for 
metagenomic sequences are achieved using homology-based approaches against publicly available reference sequence 
databases. Here, we present an overview of the current state of functional analysis of metagenomic sequence 
data, bottlenecks frequently encountered and possible solutions in light of currently available resources and tools. 
Finally, we provide some examples of applications from recent metagenomic studies which have been successfully 
conducted in spite of the known difficulties. 

Keywords: functional annotation; metagenomics; bioinformatics; next- generation sequencing; pathway- mapping; 
comparative analysis 



INTRODUCTION 

The microbial world shows vast diversity, and mi- 
crobes inhabit almost every niche on the planet. 
Many of them have been shown to be important 
members of their given ecosystems and to play crucial 
roles in various environmental and host-associated 
biological processes. However, due to their general 
unculturability (it is believed that only a small percent- 
age of bacteria in nature can be cultured [1]), up until 
just a few years ago it was practically impossible to 
sequence and analyze them in greater detail. As a 
result, a large fraction of microbes still remain poorly 



characterized and unstudied; and the means by which 
they exert beneficial or other effects in different 
environments remain largely unknown. 

The recent culture independent technology to 
study microbes inhabiting different environments, 
termed metagenomics [2], has opened new avenues 
for answering questions commonly asked in micro- 
biology, such as 'Which species inhabit a given 
environment?' and 'What are these microbes doing 
and how are they doing it?' The basic steps involved 
in a typical metagenomic project to estimate the 
number of species and the functional repertoire of 
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Figure I: Flow chart for the analysis of a metagenome from sequencing to functional annotation. Only the basic 
flow of data is shown up to the gene prediction step. For the context-based annotation approach, only the gene 
neighborhood method has been implemented thus far on metagenomic data sets; although in principal, other 
approaches which have been used for whole genome analysis can also be implemented and tested. *: A list of tools 
commonly used for these processes is provided in Table I. Table 3 provides a list of some of the additional functional 
analyses that can be performed on the metagenomic sequences. 



an environment include DNA or RNA sequencing 
using next-generation sequencers (such as lUumina 
and Roche 454), sequence assembly, gene predic- 
tion, functional and metabolic analysis, taxonomic 
binning and comparative analysis of the sequence 
data using specialized bioinformatics methods and 
tools (Figure 1, Tables 1 and 2). However, each 
stage of the analysis suffers heavily due to inherent 
problems of the metagenomic data generated, 
including incomplete coverage, massive volumes of 



raw sequence data produced by the next-generation 
sequencers, generally short read-lengths, species 
abundance and diversity and so on [3, 4]. 

These problems also adversely affect the down- 
stream functional analysis process. For example, due 
to shorter read-length the overall functional compos- 
ition is comparatively poor for shorter pyrosequen- 
cing- or Illumina-sequencing derived reads than for 
longer Sanger reads [35]. Additionally, for very 
complex communities, partial or poor assemblies are 
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Table I: List of commonly used tools for sequence 
assembly, protein coding gene prediction, RNA gene 
prediction and phylogenetic classification steps of 
metagenomic data analysis 



Process 


Tools 


1 IRI / 

References 


Sequence 


Phrap 


http://www.phrap.org/ 


assembly 


Forge 


http: / /combiol.org/forge/ 




Aracline 


[51 




JAZZ 


[6] 




Celera 


[7] 




Velvet 






Newbler 


454 Life Sciences 




SOAPdenovo 


[91 




EULER 


noi 




ORFome assembly 


mi 

L' 'J 




IDBA-UD 


ri2i 


Gene 


Metagene 


ri3i 


prediction 


GeneMark 


ri4i 




ORF-Finder 


http://www.ncbi.nlm.nih.gov/ 






Droiects/?orf/ 




FragGeneScan 


[15] 




fgenesB 


http:/ /www.softberry.com 




GLIMMER 


ri6i 

|_IV,/J 




BLAST 


[17] 


RMA apnp 




ri8i 


prediction 


Similarity-based 






searches for rRNA 






in reference databases 




Taxonomic 


MetaBin 


[19] 


binning 


rIbCjAN 


[20] 




WebCARMA 


[21] 




PhyloPythia 


[22] 




TETRA 


[23] 




NBC 


[24] 




TACOA 


[25] 


obtained 


due to incomplete 


coverage, resulting in 



many short contigs and unassembled sequences. This 
leads to the prediction of a large number of small, 
fragmented genes which may not exhibit any matches 
in the reference sequence databases, or match with 
very low significance [36]. Although sequence assem- 
bly and gene prediction tools specifically developed 
for metagenomic data sets offer some advantages over 
similar tools developed for more complete genome 
sequences, surprisingly, no such 'metagenome spe- 
cific' tools have yet been developed for functional 
analysis. Thus, appropriate tools, from the current 
repertoire, and parameters must be used to achieve 
comprehensive and biologically meaningful func- 
tional analysis of metagenomic data sets. The steps 
for sequence assembly and gene prediction of 
metagenomic data sets are compared in several 
recent comprehensive reviews [3, 4, 37, 38]. 



The scope of this review is to comprehensively 
discuss the prime objectives, methods and problems 
for functional and metabolic analysis of metagenomic 
sequence data, and to propose some solutions for the 
latter. Toward this, we first try to familiarize the 
reader with the aims of functional metagenomic ana- 
lysis and the most commonly adopted publicly avail- 
able tools and resources to achieve them. Next, we 
discuss how the problems arising from metagenomic 
sequencing affect this process, and we suggest various 
strategies for addressing some of these issues under 
the present scenario. Lastly, we demonstrate that, 
despite these issues, metagenomic functional analysis 
can still be reliably used to address globally important 
environmental and biological questions. 

OBJECTIVES OF FUNCTIONAL 
METAGENOMIC ANALYSIS 
STUDIES 

Interestingly, the same microbial communities 
sampled at different times or from different hosts 
can vary significantly. For example, the gut micro- 
biomes of 13 healthy Japanese individuals were quite 
different, yet they still shared many microbes [39]. 
Also, the community members for any given 
environment commonly play different roles. For ex- 
ample, in the human gut microbiome, segmented 
filamentous bacteria are known to play important 
roles in maintaining intestinal immunity [40, 41], 
whereas bifidobacteria are known to utilize complex 
carbohydrates and thereby exert beneficial effects on 
human health [42] . Thus, there are mainly two broad 
objectives of the functional analysis for metagenomic 
studies: the first is to determine what are the 
functional and metabolic repertoires of the different 
community members that enable them to exert 
different effects, and the second is to identify the 
variations, if any, within the functional compositions 
of the different communities, e.g. those found 
between healthy and diseased individuals that may 
be related to the cause of the disease. To determine 
the functional content of the member species of a 
microbiome, the coding and functional capacity for 
all (or at least the dominant) members should be 
comprehensively analyzed. Alternatively, if the goal 
of the study is to analyze and contrast the functional 
and metabolic capacities of different communities, 
then the functional and metabolic pathway profiles 
for the communities need to be generated and 
compared. 
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PUBLICLY AVAILABLE RESOURCES 
AND TOOLS FOR FUNCTIONAL 
ANNOTATION OF 
METAGENOMIC DATA 

Dedicated tools for functional annotation and 
analysis of metagenomic data sets lag far behind the 
rate at which the data is being generated. Recently, 
some web-based, as well as local-use based, pipelines 
have been developed for the analysis of metagenomic 
data sets. Table 2 provides a list of a few well-known 
representative pipelines and compares the functional 
analysis capacity of each. Almost all of these pipelines 
provide integrated platforms for the functional pre- 
diction of metagenomic sequences using multiple 
tools and databases, which are also commonly used 
for the analysis of whole genome sequences. Most of 
the pipelines offer sufficient resources for the func- 
tional analysis of user data. However, to account for 
the inherent problems associated with the metage- 
nomic data sets, it is highly recommended to evalu- 
ate the computational workflow and parameters for 
any given project. This can be achieved by using 
simulated sequencing reads generated by MetaSim 
[43], to assess and compare different tools before 
actually using them on full data sets. The analysis 
time of any pipeline typically depends on the size 
of the data sets and, in the case of web-based servers, 
the load of requests that are already in progress sub- 
mitted by other users. Web-based servers such as 
CAMERA [28], MG-RAST [30] and IMG/M [26] 
host pre-computed results for most published meta- 
genomes that enable users to perform comparative 
analysis with their own data sets. In most cases, the 
computed data can be visualized in the form of 
simple plots. However, KEGG [44] pathway maps 
and abundance profiles can also be obtained using 
the IMG/M and MG-RAST servers. 



STRATEGIES COMMONLY 
ADOPTED BY THE PIPELINES FOR 
THE FUNCTIONAL ANALYSIS OF 
METAGENOMIC DATA 

Protein function is a very broad term, as function can 
be predicted at several different levels. For example, 
the Gene Ontology database [45] adopts three broad 
domains for classifying gene products viz., the cellu- 
lar location of the protein, the overall biological 
process it takes part in and the molecular function 
of the protein. On the other hand, the subsystem- 
based classification approach adopted by the SEED 



database [46] relies mainly on the grouping of func- 
tional roles into subsystems by curation experts. The 
defined subsystems may be thought of as a general- 
ization of the term 'pathway'. Similarly, the KEGG 
database [44] is a resource of pathway maps built 
from both genomic and chemical information of 
the biological systems. However, such specific func- 
tional assignment may be lacking for completely 
novel proteins or for those which share very weak 
homology with known proteins both of which are 
ample in metagenomic data sets. For such proteins, 
even minimal information that can be extracted 
related to their function can be useful, and may be 
the only available clues to their function. 

As shown in Figure 1 and Table 2, the basic tools 
that are implemented in almost all of the available 
pipelines for functional analysis of metagenomic data 
are the same as those which are commonly used for 
whole genome studies and are well known. However, 
their performance in the metagenomic context have 
yet to be evaluated and reviewed. Thus, in the current 
review, we have divided these tools into four cate- 
gories based on their inherent approach. In the fol- 
lowing sections, we review each approach in context 
to its application to metagenomic data analysis, keep- 
ing in mind the associated problems of the data itself 

Homology-based approach 

As shown in Table 2, the 'simplest' and most common 
approach adopted by all of the available pipelines for 
functional prediction is by comparison of the 
predicted query proteins to existing resources of ref- 
erence protein sequences, including NCBI NR [47], 
SMART [48] and UniProt/UniRef [49]. The IMG/ 
M [26] and MG-RAST [30] servers also search the 
publicly available metagenomic data sets for homologs 
of the query sequences. The databases of clusters of 
orthologous groups (COGs) [50], non-supervised 
orthologous groups (NOGs) [51], protein families 
and domains including Pfam [52] and TIGRFAM 
[53], etc. are used by several pipelines to infer func- 
tional categories or to identify families and domains 
embedded in the query proteins. In some cases, 
similarities to genes found in the GO database are 
further explored to infer hierarchical annotations. 
Pathway and subsystem information for the query 
proteins is inferred by searching for homologs in the 
KEGG and SEED databases, respectively, by almost 
all of the pipelines. 

For these searches, different variants of BLAST 
[17] are the most preferred algorithms, including 
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BLASTX, BLASTP, RPS-BLAST, etc. For less 
sensitive, but faster, searches BLAT [54] may also 
be used, as in the case of MG-PJVST server. 
Additionally, more sensitive profile- and pattern- 
based search methods are used by almost all of the 
pipelines in which sequence profiles generated from 
alignments of protein families in Pfam or TIGRfam 
databases are searched using the hidden Markov 
model-based algorithm, HMMER [55]. For all 
these methods, best hits are identified based on 
statistical calculations and annotation information is 
directly applied to the query proteins. 

Homology-based approaches mainly suffer from 
the long computation time required to search for 
homologs for each of the sequences within the typ- 
ically massive metagenomic data sets. Additionally, 
BLAST-based functional predictions have been esti- 
mated to include 13-15% database propagation 
errors [56]. Moreover, to detect a true match, the 
reference database being searched needs to contain at 



least one homolog of the query sequence. And, the 
fragmentary nature of the shotgun-generated meta- 
genomic data leading to partial proteins negatively 
impacts homology-based function prediction. This 
is discussed in more detail below. 

The extent to which metagenomic functional 
annotation has been achieved using different databases 
is demonstrated in Figures 2 and 3. The highest frac- 
tion of metagenomic sequences were annotated using 
the NCBI RefSeq database, which is a comprehensive 
collection of non-redundant well-annotated protein 
sequences. On the other hand, only a small fraction of 
sequences could be annotated using the Swiss-Prot 
database, which harbors manually annotated and 
reviewed protein sequences. The number of proteins 
annotated using the COGs database was slightly less 
than RefSeq. Among the protein family and profile 
databases, more predictions were made using Pfam as 
compared to the TIGRFAM database. This could 
mainly be due to the great number of protein families 
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Figure 2: Distribution of metagenomic sequence matches in the SwissProt, RefSeq, KEGG and SEED databases at 
various E-value cut-offs. Smaller sequences match at lower confidence (higher £-values; lighter colors) or do not 
match at all in the databases. More sequences match with higher confidence (lower E-values; darker colors) as the 
sequence length used for the analysis increases. Pre-computed data for the metagenomes shown was derived from 
the MG-RAST server. 
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that are included in the Pfam database (13 672 in 
Pfam 26.0 release) than in the TIGRFAM database 
(4209 in TIGRFAM 12.0 release). The annotation 
using KEGG metabolic pathways is relatively low 
mainly due to the inherent problems of the metage- 
nomic data sets, as discussed below. The SEED system 
of classification performs similar to that of KEGG, 
although the number of predictions is slightly lower. 

Motif- or pattern-based approach 

The partial proteins generated from short contigs and 
unassembled sequences which arise due to short 
read-lengths or complex environments generally 
exhibit very poor similarities using homology-based 



approaches (Figure 2). Additionally, some proteins, 
despite sharing a common function, are more diverse 
at the sequence level. The overall sequence similarity 
of such proteins is usually lower than the thresholds 
used for homology-based functional prediction; how- 
ever, they still share one or more common sequence 
or structural patterns or motifs necessary to maintain 
their structure and function. Currently, databases like 
PROSITE [64] and PRINTS [65] present a reliable 
repository of such patterns or motifs against which the 
query metagenomic sequences may be searched either 
independently or through the integrated InterPro 
database [66]. Currently, only the IMG/M server in- 
corporates the InterPro database. However, a general 
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Figure 3: Status of functional prediction of protein-coding genes from different metagenomic data sets and 
representatives of completely sequenced genomes. The overall functional prediction bars represent the fraction of 
protein-coding genes that map to at least any one of the four databases including cluster of orthologous groups 
(COGs), Pfam, TIGRFAM and KEGG pathways. For comparative purposes, the functional annotation status for the 
well-studied model microbial genome, E. coli KI2-W33I0, the smallest microbial genome, M. genitalium, and the 
human genome are also shown. The data for this graph was derived from the IMG/M database. It should be noted 
that for uniform comparison, the prokaryotic COGs version was also used for Homo sapiens.The number of matches 
to eukaryotic COGs (KOG database [57]) may be higher for H. sapiens. The numbers next to the bars represent 
the total number of predicted protein-coding genes in each data set using the IMG/M annotation pipeline. For the 
Sludge [58] community, data from only the Phrap assembly, a widely used program for DMA sequence assembly, 
was used. Except for the Cow Rumen Viral community [59], which was sequenced using the 454 platform (average 
read-length > 300 bp), all other metagenomes were sequenced using the Sanger method (average read-length - 
1000 bp). The following additional data sets were used: Ocean [60], Soil [61], Acid Mine Drainage [62], 
Human Gut [63]. 
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problem with motif-based annotation is that short 
sequence matches typically show low statistical signifi- 
cance and false-positive rates can be high [67]. 
Nevertheless, given the amount of novelty inherent 
in metagenomic data sets, it is recommended to run 
motif-based analysis in parallel with other functional 
prediction approaches. 

Context-based annotation 

Metagenomic data sets contain a large number of novel 
sequences which share no homology with known 
sequences and thus remain unannotated by the previ- 
ous two approaches. To overcome these limitations, 
gene context-based approaches may also be used. A 
few examples from single genome annotation projects 
include genomic neighborhood [68, 69], gene fusion 
[70, 71], phylogenetic profiling [72] and gene 
co-expression analysis [73]. Among these, only the 
genomic neighborhood approach has been imple- 
mented in the case of metagenomics. In 2007, 
Harrington etal [74] applied a combination of homol- 
ogy-based searches and customized gene neighbor- 
hood methods to four metagenomic data sets derived 
from a variety of complex environments. Whereas 
BLAST-based methods alone annotated 70% of the 
sequences, their combined method inferred specific 
functions for 76% and non-specific functions for 83% 
of the sequences. However, due to the paucity of 
complete genomes in metagenomic data sets and the 
lack of knowledge about the true species origin of the 
sequences, this approach has its limitations. These 
problems may be ameliorated by increasing the 
sequencing depth and by improving the taxonomic 
assignment of the sequences. Additionally, better 
assemblies resulting in longer contigs will also improve 
the efficiency of context-based annotation methods. 
Currently, only IMG/M and SmashCommunity [31] 
can be used to view predicted genes in the genomic 
neighborhood context. 

Other types of functional prediction 

Lastly, the putative roles of the metagenomic sequences 
can also be inferred by running more specific analyses 
using dedicated tools that target prediction of carbohy- 
drate active enzymes, glycosyl hydrolases, protein 
localizations, lipoproteins, adhesins, secretory proteins, 
transporters, CRISPRs (Clustered Regulatory Inter- 
spaced Short Palindromic Repeats), insertion 
sequences, virulence factors, etc. A list of a few repre- 
sentative tools for such analysis is given in Table 3. It 
should be noted that the list is not comprehensive, and 



that a discussion about all the tools for the above- 
mentioned purpose is beyond the scope of this review. 

GENE-CENTRIC ANALYSIS OF 
METAGENOMIC DATA SETS 

To explore the effect of environment on the functional 
and metabolic contents of different communities, 
comparative functional analysis may be performed on 
the total gene-content of the communities, i.e. 
gene-centric analysis. For this purpose, functional 
profiles can be compared and contrasted across differ- 
ent metagenomic data sets to look for functional 
characteristics responsible for community differences. 
Normally two levels of comparison are performed, 
viz., comparison of abundance of functional families 
and pathways, and estimation of statistical parameters 
to ensure that the observed differences in abundance 
are not merely chance occurrences. Different types of 
abundance profiles may be generated and compared 
using, for example, COGs functional categories, 
Pfam functional families, KEGG metabolic pathways, 
or SEEDs subsystems. However, before comparing the 
metagenomes, proper normalizations of the data sets 
should be performed to account for the data- associated 
problems, such as partial genes and effective genome 
sizes (discussed later). Heat-maps are commonly used 
to visualize the differences in communities with respect 
to the above-mentioned functional or metabolic 
profiles (for example [60, 61, 76—78]). In addition, 
statistical methods, such as principal component ana- 
lysis (PGA) and multidimensional scaling (MDS), may 
be used to reveal which factors most affect the observed 
data (for example [79, 80]). The common approaches 
and limitations of the gene-centric analysis are 
discussed and reviewed by Kunin etal [3]. 

PROBLEMS ASSOCIATED WITH 
FUNCTIONAL ANALYSIS OF 
METAGENOMIC DATA 

The analysis and annotation of metagenomic data sets 
differ from that of whole genome studies mainly 
because the former is a complex mixture of sequences 
from multiple species. Even draft quality bacterial 
whole genome sequences represent most of the 
chromosomes, except for a few of the more complex 
regions that include repeats, insertion sequences, 
tRNAs, rPJsFAs, etc. When sequence coverage is suf- 
ficient, the assemblies obtained usually result in very 
long contigs with few gaps. The efficiency of gene 
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Table 3: List of commonly used available resources for functional analysis (other than homology-, motif- and 
context-based) that can be performed on metagenomic data sets 



Type of prediction 


Resource name 


URL 


Carbohydrate-active enzymes 


CAZy 


http 


//www.cazyorg/ 


Glycosyl hydrolases 


GAS 


http 


//csbl. bmb.uga.edu/~ffzhou/GASdb/ 


Protein localization 


PSORT 


http 


//psort. hgc.jp/ 




Cell-PLoc 


http 


//www.csbio.sjtu. edu.cn/bioinf/Cell-PLoc/ 




CELLO 


http 


//cello.life.nctu.edu.tw/ 




PA-SUB 


http 


//webdocs.cs.ualberta.ca/~bioinfo/PA/Sub/index.html 


Membrane proteins 


DAS 


http 


//www.sbc.su.se/~miklos/DAS/ 




HMMTOP 


http 


//www.enzim.hu/hmmtop/html/submit.html 




HMM-TM 


http 


//bioinformatics.biol.uoa.gr/HMM-TM/index.jsp 




TMB-Comp 


http 


//bmbpcu36.leeds.ac.uk/~andy/betaBarrel/TMB.Hunt2/TMB.Comp.cgi 


Lipoproteins 


DOLOP 


http 


//www.mrc-lmb.cam. ac.uk/genomes/dolop/dolop.htm 




LlPO 


http 


//services.cbu. uib.no/tools/lipo 




SignalP 


http 


//www.cbs. dtu.dk/services/SignalP/ 




LipoP 


http 


//www.cbs.dtu.dk/services/LipoP/ 




PRED-LIPO 


http 


//bioinformatics. biol.uoa.gr/PRED-LIPO/input.jsp 


Secretory proteins 


Tatfind 


http 


//signalfind.org/tatfind.html 


(signal peptide Type 1) 


TatP 


http 


//www.cbs. dtu.dk/services/TatP/ 




SignalP 


http 


//www.cbs.dtu.dk/services/SignalP/ 




PrediSi 


http 


//www.predisi.de/index.html 


Adhesins 


SPAAN 


Sachdeva et al. 2004 [75] 


Transporters 


TansportTP 


http 


//bioinfo3.noble.org/transporter/ 




TransAAP 


http 


//www.membranetransport.org/transaap/TransA APJogin.html 




TCDB 


http 


//w ww.tcdb.org/ 


Insertion sequences 


ISsaga 


http 


//issaga. biotoul.fr/ISsaga/issaga.index.php 


CRISPRs 


PILER 


http 


//www.driveS.com/pilercr/ 




CRISPRfinder 


http 


//crispr.u-psud.fr/Server/ 


Repeats 


Tandem Repeats Finder 


http 


//tandem, bu.edu/trf/trf.html 




EMBOSS 


http 


//emboss.sourceforge.net/ 


Virulence factors 


VFDB 


http 


//www.mgc.ac.cn/VFs/ 




MvirDB 


http 


//predictioncenter. Ilnl.gov/ 



prediction algorithms on such long contigs is quite high 
and most of the full-length coding DNA sequences 
(CDSs) can be predicted with high confidence. 
Functional prediction analysis can next be applied to 
obtain the functional repertoire of the genome. The 
functionally annotated CDSs can then be viewed in the 
context of metabolic pathways to predict the metabolic 
capabilities of the species under study. 

A metagenome can be viewed as a collection of 
several whole genomes. To fully understand an en- 
vironment, in principal, draft quality whole genome 
sequences for every member should be achieved by 
complete DNA sequencing. However, in spite of the 
availability of high throughput second-generation 
sequencers, this is still a very expensive and daunting 
task. What can be best captured from a metagenomic 
sample is a mixture of fragmented sequences from 
the community members, and mostly from domin- 
ant members of the environment. When the sequen- 
cing depth is sufficient, and by the use of sequence 
assemblers developed specifically for metagenomic 



data (Table 1), draft quality assemblies for some of 
the member species may be achieved; e.g. a draft 
methanogen genome was recently assembled from 
a permafrost microbial community [78]. However, 
this still did not suffice for completely understanding 
the environment, as the assemblies for many other 
members remained poor due to the inherent com- 
plexity of the environments and lower sequencing 
coverage for these genomes. Thus, for most metage- 
nomic studies, we are left with only enormous 
volumes of fragmented sequences (comprised of a 
mixture of short contigs and singletons) from mul- 
tiple species to perform analysis on. In the case of 
contigs, gene predictions will be more accurate, 
whereas the predicted genes from singletons will 
almost always be partial in spite of using gene 
prediction tools specifically developed for metage- 
nomic data (Table 1), unless very long read-lengths 
were obtained during sequencing. This is mainly 
because the typical average read-lengths generated 
by next- generation sequencers providing deeper 
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coverage, including Illumina, are still smaller (up to 
300 bp for paired-end reads) than the average size of 
the typical prokaryotic protein coding gene 
(^1000 bp [81]). The 454 pyrosequencing platform 
can be an alternative technology due to the longer 
average read-lengths it can generate (up to 700 bp for 
454 GS FLX-h pyrosequencer, http://454.com/ 
downloads/GSFLXAppHcationFlyer_FINALv2.pdf), 
but it is not the preferred choice mainly due to its 
lower coverage and higher cost as compared to 
Illumina sequencing. 

To obtain the most complete information of the 
functional repertoire for any metagenome it is recom- 
mended to use the genes predicted from both the 
contigs and the singletons, even though many of the 
predicted CDSs are partial. In general, short query 
lengths negatively impact homology-based functional 
prediction as they may decrease the significance of 
pairwise similarities due to added noise. This is clearly 
evident from Figure 2, v^hich shows that there are no 
matches for sequences of length ^100 bp for the 
'Cow Rumen' metagenome [79] in the lower and 
more significant E-value bins (& value < le — 10). 
On the other hand, as sequence length increases, the 
E-value bins with lower values become more popu- 
lated, as in the case of the 'Human Gut Japanese' [39] 
data set. Additionally, for short sequence lengths, 
homology-based approaches have limited sensitivity. 
For example, only ^25% of the 'Cow Rumen' se- 
quences could be annotated using GenBank, whereas 
>75% of the 'Human Gut Japanese' sequences could 
be annotated using the same database with the same 
parameters (Figure 2). These problems may be ame- 
liorated to some extent by increasing sequencing 
depth or read-length so that better assemblies and 
gene predictions can be obtained. 

Another problem in metagenomic functional ana- 
lysis stems from the lack of knowledge of the species of 
origin of the sequences. Although phylogenetic clas- 
sification and binning methods specific to metage- 
nomic sequences may be able to classify 40-93% of 
the reads [19] at the genus level, depending on the 
novelty of the data set, at the species level this per- 
centage is expected to decrease. This indicates that at 
least 7-60% of the sequences still remain unclassified 
due to the limitations of the available tools and the 
paucity of reference genomes in the public databases. 
Thus, in spite of gaining some functional information, 
due to the absence of specific species information, it is 
extremely difficult to put together many functionally 
annotated metagenomic sequences in context of their 



actual metabolic pathways. Additionally, because 
most of the metagenomic sequences wiU be derived 
from the dominant species, the complete functional 
and metabolic repertoire of the less abundant 
members cannot be obtained. Other techniques com- 
plimentary to metagenomics, such as single cell gen- 
omics [82], may help in overcoming this problem by 
providing access to the genomic DNA from uncultur- 
able microbes. However, even single cell genomics 
has many challenges remaining [82]. Nevertheless, if 
the objective of the metagenomic study is to only 
analyze the overall metabolic capacity of the entire 
community, then putting the sequences in context 
of their individual genomes of origin may not pose a 
serious problem. 

Given that metagenomic studies are aimed at 
exploring complex environments harboring many 
yet uncultured and unknown microbes, the data sets 
are expected to possess a large number of novel se- 
quences. As shown in Figure 3, the overall functional 
annotation achieved in the case of some example bac- 
terial metagenomes is 50-75%, with the remaining 
sequences being unannotated. Even for 'complete' 
genomes, functional annotation is not complete. In 
the most studied model organism, Escherichia coli 
K12-W3110, and the smallest studied genome. 
Mycoplasma genitalium, both of which are considered 
'simpler' systems, the overall functional annotation 
remains ^90%. And, in a more complex system viz., 
the human genome, only ~82% of the predicted 
proteins are currently annotated. For the even more 
complex human gut metagenome, this number 
decreases to ~75%. Interestingly, while ocean and 
soil are also considered as 'complex metagenomes' 
on the scale of the human gut microbiome, only 
^50-55% of the sequences in these communities 
can be annotated. This difference in level of annota- 
tion could be due to a bias in the number of 
human-associated microbial genomes that have thus 
far been sequenced and are included in the reference 
sequence databases. To deal with the novelty of meta- 
genomic data, reference genome sequencing efforts 
should be initiated for other environments as has 
been done under the Human Microbiome Project 
[83], which plans to sequence a large number of 
reference genomes from different body sites for the 
human microbiome. 

While the functional annotation of bacterial 
metagenomes is at a reasonable level and is gradually 
improving, the situation for viral metagenomes, or 
viromes, lags far behind. The extent of virome 
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annotation for cow rumen [59] and human lung [80] 
drops to as low as 13-15% (Figure 4) in comparison to 
bacterial annotation (cow rumen: 32%) for similar 
environments. The average metagenomic read-length 
used for the human lung virome was only 84 bp. One 
might argue that this reduction in the percentage of 
functional annotation may be due to the short 
read-length, which is known to affect the extent 
and confidence level of the functional prediction 
process, as discussed earlier. But, surprisingly, the per- 
centage of functional annotation for the cow rumen 
virome is also low (15%), despite using a longer 
read-length (>300bp). Thus, this reduction in the 
extent of functional prediction for viromes could be 
mainly due to the limited number of completely 
sequenced viral species in the reference databases. 

The genome sizes of the individual microbial 
members of a community can vary greatly. It is 
known that larger genomes harbor a smaller relative 



■ Overall functional prediction 

■ COGs 

■ KEGG pathways 




Cow Rumen (454, > 
^00bp, Gtnes from 
Assembled Contigs 



Human Lungs (454, 
84 bp, Raw Reads) 



Figure 4: Status of functional prediction for viral 
metagenomes.The bars for the Cow Rumen viral meta- 
genome data set represent the percentage of genes 
predicted from assembled contigs, while those for the 
Human Lung viral metagenome data set [80] represent 
the percentage of raw reads. 



fraction of universal and housekeeping genes, and thus 
contain a large number of novel genes [84, 85]. 
Indeed, a weakly significant positive correlation was 
found between the effective genome size and the 
potential for carrying novel genes [86]. Therefore, 
the average genome size in an environmental sample 
could also affect the comparative functional analysis of 
the metagenome. Recently, Beszteri etal. [87] demon- 
strated how, among metagenomic samples, the 
differences in relative gene abundance, which are 
often used to interpret habitat-specific adaptations, 
are biased by the average genome size of the commu- 
nities sampled. Thus, before arriving at biological 
conclusions from functional analysis of metagenomic 
data sets, the latter should be normalized to account 
for their different average genome sizes. 

Apart from the aforementioned problems, the ana- 
lysis of metagenomic data sets can also be influenced by 
the sequencing technology used. For example, 454 pyr- 
osequencing technology produces between 11-35% 
artificial replicates, both identical reads (duplicates) and 
reads that begin at the same position but vary in length 
or contain sequencing discrepancies, which lead to 
biased functional annotations [88]. Replicates were 
also observed in an lUumina sequenced permafrost 
microbial community analysis [78]. Thus, the metage- 
nomic reads should be de-replicated before in-depth 
functional analysis is performed. Both 454 pyrosequen- 
cing and the more recent Ion Torrent sequencing 
technologies are known to introduce frameshift errors 
in the reads, mostly due to homopolymer runs. Almost 
none of the available bioinformatics tools for functional 
annotation of metagenomic sequences are capable of 
handling such errors; although several specialized tools 
for frameshift detection are currently available [89—93] 
in the public domain and should be used for more 
in-depth functional analysis. In some cases, the proto- 
cols used for sample preparation, particularly the use of 
filters or other sample selection methods, can also lead to 
inappropriate biological interpretations. For example, in 
the first Sargasso Sea data set [94], some nitrogen-fixing 
genes were found to be lacking [95] . However, the lack 
of these genes was later attributed to the absence of their 
main contributors, cyanobacteria, which were likely 
removed during the filtration step [96] . 



APPLICATIONS OF METAGENOMIC 
FUNCTIONAL ANALYSIS 

Despite the challenges for metagenomic functional 
analysis, many studies exploring different environments 
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are being conducted with varying degrees of success. 
The applications of metagenomic functional analysis is 
an extremely important and versatile subject; and, 
given the scope of the current review, it is impossible 
to comprehensively discuss it here. Therefore, to ex- 
emplify the successful implementation of metagenomic 
functional analysis to answer some biologically and en- 
vironmentally important issues, a few recent example 
studies are presented in the following sections. For a 
discussion of other studies of major interest, we recom- 
mend the comprehensive review by Wooley et al. [4] . 

Comparative metagenomic-based 
studies 

Recently, in a large-scale metagenomic analysis of 
124 European individuals, a catalogue of over 3.3 
million human gut microbial genes was created [97]. 
This led to the identification of bacterial functions 
that are necessary for a bacterium to thrive in the 
gut context, and to those functions involved in 
homeostasis of the entire ecosystem. This catalogue 
not only provides a good resource for annotating new 
human gut-related metagenomes and for comparative 
analysis, it also enables future studies to discover asso- 
ciations between the microbial genes and human 
phenotypes. In another study, the gut metagenomes 
of four healthy individuals were compared to those of 
individuals with autoimmune disorders, including 
type I diabetes [98]. This analysis suggested that 
increased adhesion and flagella synthesis in diseased 
individuals may be involved in triggering type I dia- 
betes associated autoimmune response. Recently, a 
comparison between the human gut environment 
and the oral cavity was made by comparing the two 
metagenomes, and clear distinctions in the functional 
capacities of the two niches were observed [99] . In 
the same study, another comparison between oral 
metagenomes from supragingival dental plaque and 
cavities of healthy and diseased individuals, respect- 
ively, suggested that the dental plaque of healthy in- 
dividuals (those who have never suffered from caries) 
may be a genetic reservoir for novel anticaries com- 
pounds and probiotics, which are live microorganisms 
thought to be beneficial to the host organism. 

Metagenomics studies to date have not only aimed 
at exploring human health-related issues, but have also 
attempted to address various environmental issues. 
Global warming resulting from the emission of green- 
house gases is a major concem worldwide. Rising 
global temperatures cause permafrost, a vast reservoir 
of natural carbon, to thaw, resulting in microbial 



degradation of organic matter and emission of more 
greenhouse gases. Comparative metagenomics of 
permafrost was recently applied to both the frozen 
and thawed states to analyze the shifts in microbial 
and functional composition [78]. Multiple genes 
involved in carbon and nitrogen cycling were found 
to shift rapidly during thaw. From this study, important 
insights about the microbial species and functional 
components involved in greenhouse gas emissions 
may be obtained. 

Metagenomic data-mining-based studies 

The natural diversity and affluence of metagenomic 
data is enormous. Over 300 independent metagenomic 
projects have already been completed or are underway. 
These facts provide a great opportunity for in-depth 
mining of metagenomic data and exploration of 
novel gene candidates useful under a variety of different 
scenarios. For example, the metagenomic data sets from 
10 diverse sources were used to identify several novel 
candidates for commercially useful enzymes (CUEs) 
[100]. A catalogue of 510 CUEs was prepared using 
literature search followed by manual curation, and then 
the catalogue was used to find homologues in the 
metagenomic data sets. High-throughput functional 
metagenomic screening may be used to look for the 
presence of CUEs and other specific enzymes of inter- 
est in the metagenomes [101]. In another study, the 
recruitment of genomes from pathogens against the 
metagenomes of healthy individuals containing 
commensal strains of the same species was used to iden- 
tify the genomic regions of individual bacterial isolates 
missing in the metagenomes [102]. These regions are 
referred to as metagenomic islands and are found to 
harbor several virulence-related genes specific to the 
pathogenic strain. 

CONCLUSIONS 

Metagenomic sequencing provides a unique oppor- 
tunity to explore yet unknown environments in 
great detail. Functional analysis of the metagenomic 
data plays a central role in such studies by providing 
important clues about functional and metabolic 
diversity, as well as variation. While metagenomic 
studies continue to suffer from certain caveats that 
make the downstream data analysis a challenging task 
for bioinformaticians, the gradual improvement in 
metagenomic technologies and development of 
tools and resources that account for the known prob- 
lems will relieve some of the burdens. For example. 



Functional assignment of metagenomic data 



723 



the use of next-generation sequencers producing 
longer read-lengths (>300bp) will usually lead to 
better sequence coverage. This can then be followed 
by the use of sequence assembly and gene prediction 
tools and parameters specifically developed for meta- 
genomic sequences which will further help in im- 
proving assembly and gene prediction efficiency, 
respectively, and will result in a greater number of 
complete predicted proteins. Better functional as- 
signments for metagenomic data sets can be obtained 
by using more complete proteins. However, while 
comparing the abundance profiles of functions be- 
tween communities, the frequencies of the functions 
should not be masked by the assembly, and the read 
depths of the contigs should be accounted for. 
Another common problem that is usually encoun- 
tered in metagenomic data functional analysis is the 
long computational time that is required for 
BLAST-based homology searches for orthologs. 
The use of alternative search algorithms, such as 
BLAT, can provide analysis results in shorter times; 
however, the loss of sensitivity by BLAT-based 
searches should be taken into account when analyz- 
ing the results. Alternatively, profile-based search 
methods using the HMMER algorithm may also 
be used whenever pre-computed sequence profiles 
are available. Certain issues, including large volumes 
of metagenomic sequence data, large storage require- 
ments for the analyzed data, and the typically large 
number of unknown sequences in the metagenomic 
data still pose serious challenges for its analysis. 
Therefore, there is great need for the development 
of new, faster, more sensitive tools and more thor- 
ough resources dedicated to the functional analysis of 
metagenomic data sets. Also, it is strongly advised 
that when analyzing the data, one must be aware 
of any additional factors that can influence the func- 
tional analysis, including sample preparation, sequen- 
cing method, diversity of the environments, etc. 
Proper calibrations, normalizations and statistical 
tests for significance should always be performed in 
order to arrive at the most reliable conclusions. 

DNA sequence-based metagenomic functional ana- 
lysis is limited in that it only provides information about 
the functional content of an environment. Thus, it may 
be complemented by other independent approaches 
that help to gain further insights about the more dy- 
namic aspects of a given community. For example, a 
few metatranscriptomic projects have been undertaken 
to address which genes are actually being expressed in 
different environments and to what extent [103, 104]. 



Given that proteins are much more stable than 
mPJsFAs [105], a proteome -based analysis is expected 
to provide a more accurate view of the functionality of 
a given environment. Toward this, a few metaproteo- 
mic studies have been conducted to explore which 
protein products are formed and how are they involved 
in the cross- talk within the environment under differ- 
ent conditions [106—109]. The metabolome, which 
represents the complete set of small molecules in an 
organism, can influence gene expression and protein 
function. Therefore, metabolomics also plays a key role 
in understanding cellular systems and decoding the 
functions of genes [110, 111]. A few metabolomic 
analyses have been conducted to determine which 
metabolites are produced as a result of the underlying 
metabolic pathways that are being exerted in a given 
community and to study host-microbe interactions 
[112-117]. Another alternative to the DNA-based 
studies used for determining microbial community 
composition, metalipidomics, is being implemented 
mainly to identify the living microbial cells in an 
environment [118]. Intact polar lipids (IPLs), which 
are the basic building blocks of biomembranes, are 
ubiquitous in nature and have several characteristics 
that make them usefial as proxies for living microbial 
cells. To date, metabolomic studies have not been 
directly used for the functional analysis of environ- 
ments. However, studies seeking to identify microbes 
of speciflc functional interest may be conducted, as has 
been done for ammonia-oxidizing microbes from 
marine and estuarine sediments [119]. The functional 
component of the environment may then be exten- 
sively analyzed using different approaches to gain more 
insights about the cross-talk taking place in that 
environment. Thus, the application of metalipidomics 
to study host- associated microbial composition and 
functional analysis, while not yet explored, appears 
promising. 



KEY POINTS 

• Read-lengths generated during metagenomic sequencing 
influence assembly, gene prediction and eventually functional 
analysis. The enormous volume of sequence data, which leads to 
long computational times and massive storage requirements, 
also impedes metagenomic functional prediction. 

• Factors that potentially influence functional analysis of metage- 
nomic data, including sample preparation, sequencing method, 
average genome size, etc. should be considered prior to analysis. 

• A higher fraction of metagenomic sequences are annotated 
using BLAST against data-rich reference sequence databases 
such as NCBI NR as compared to SwissProt, COGs, KEGG, etc. 

• Integrated methods using more than one approach can improve 
the efficiency and reliability of functional predictions. 
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• DNA-sequence-based metagenomic functional analysis should 
be complemented with other types of approaches, such as meta- 
transcriptomics, metaproteomics, metabolomics and metalipi- 
domics, to gain better insights of the dynamics of a community 
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